LDA (Operator Toolbox)

Synopsis

This operator finds topics using the LDA method.

Description

LDA (Latent Dirichlet Allocation) is a method which allows you to identify topics in documents. This implementation of LDA uses the ParallelTopicModel of the Mallet library (source: Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009)) with SparseLDA sampling scheme and data structure (source: Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)).

LDA provides topic diagnostics in the model object. For details on the measures see: http://mallet.cs.umass.edu/diagnostics.php . Note that some of the measures depend on the number of top words.

LDA uses Gibbs Sampling for the application of the model. The method exposes additional parameters in the application.

Input

col (Collection)
A preprocessed collection of documents.

Output

exa (Data Table)
An ExampleSet with added "documentId" and "TopicId" attributes, and an additional attribute showing the confidence that this document belongs to the topic.
top (Data Table)
An ExampleSet with details on the topic. For each topic the operator returns the top 5 most used words.
mod
The topic model. It can be applied to new collection of documents using Apply Model (Documents).
per (Averagable)
The LogLikelihood value of the fit which can be used for optimization.

Parameters

number_of_topics Number of topics to search. Range:
use_alpha_heuristics If this parameter is set to true, alpha is automatically set. The used heuristics is: 50 / Number of topics. Range:
alpha_sum Baysian prior on the topic distribution. Range:
use_beta_heuristics If this parameter is set to true, beta will be automatically set. The used heuristics is: 50 / Number of words. Range:
beta Baysian prior on the word distribution. Range:
optimize_hyperparameters If this parameter is set to true, both alpha and beta will be optimized every k-th step. k can be provided by the "optimize interval for hyperparameters" parameter Range:
optimize_interval_for_hyperparameters Frequency of hyperparameter optimization. Range:
top_words_per_topic Number of words to pull to describe one topic. Range:
iterations Number of iterations for optimization. Range:
reproducible If this parameter is set to true, parallel execution will be deactivated. Results may differ between runs if this is left unchecked. Range:
enable_logging If this parameter is set to true, additional output is provided in the Log panel. Range:
use_local_random_seed This parameter indicates if a local random seed should be used. Range:
local_random_seed If the use local random seed parameter is checked this parameter determines the local random seed. Range:
include_meta_data If checked, available meta information of the text like filename, date is added as attribute. Range:
LDA.iterations (Application) Number of iterations for Gibbs sampling. Available in Apply Model (Documents). Range:
LDA.burnin (Application) Ignore the first x rounds of sampling. Should be > iterations. Available in Apply Model (Documents). Range:
LDA.thinning (Application) Only use every x-th iteration to determine the confidence. Available in Apply Model (Documents). Range:

Tutorial Processes

A simple application on lorem ipsum

This sample process is a minimalist example for LDA. It generates a collection of documents based on Lorem Ipusm, processes them using the Text Processing extension, and feeds it into the LDA operator.

Categories

Versions